Skip to content

fix(router): label cascade Dockerfiles so dangling-image cleanup actually matches#1256

Merged
zbigniewsobiecki merged 2 commits intodevfrom
fix/dangling-image-cleanup-dockerfile-label
May 3, 2026
Merged

fix(router): label cascade Dockerfiles so dangling-image cleanup actually matches#1256
zbigniewsobiecki merged 2 commits intodevfrom
fix/dangling-image-cleanup-dockerfile-label

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

  • scanAndCleanupDanglingImages (PR fix(router): periodic dangling cascade-image cleanup (close 102GB leak) #1243) was a no-op since deploy — it filters images by dangling=true AND label=cascade.managed=true, but no cascade-built image carried that label. cascade.managed=true was only ever applied to running containers (in container-manager.ts); never baked into any Dockerfile.
  • Live verification on the dev host (2026-05-03): 140 dangling images present, but docker images --filter dangling=true --filter label=cascade.managed=true | wc -l returns 0. Every prod cleanup-pass log line shows removedCount=0, reclaimedBytes=0.
  • Adds LABEL cascade.managed=true to all five cascade Dockerfiles. The existing strict label filter starts matching exactly the right set without any code change in the cleanup loop and without any blast-radius widening.
  • Static guard test pins both halves of the contract: the filter shape AND the per-Dockerfile LABEL directive.

Why the strict filter stays strict

The cascade-router host runs other unrelated workloads (ucho-dev/prod, MySQL, Loki, etc.). Widening the filter would risk reaping their dangling images. The right fix is to make cascade-built images carry the label so the existing safety belt actually matches them.

Out of scope

  • Pre-label dangling backlog (~130 images on prod) needs a one-off manual prune after deploy:
    docker image prune --filter "until=24h"
    
    Not automated — would need a separate "garbage-collect pre-label dangling images" loop with stricter age + size filters, different invariants, different review surface.
  • Cascade-worker tag bloat (29 SHA-pinned tags accumulating on the registry, ~3.81GB tagged each but layer-shared). Separate retention loop, separate PR.
  • Buildkit cache (15.5GB on dev). Buildkit's own gc is the right primitive; not a cascade-router responsibility.

Test plan

  • npx vitest run --project unit-api tests/unit/router/dangling-image-cleanup.test.ts — green (20 tests, 5 new LABEL guards)
  • npm test — full unit suite (8781 tests, 23 skipped, 0 failures)
  • npm run typecheck — clean
  • npm run lint — clean (13 pre-existing warnings, unrelated)
  • After merge to main + first cascade-worker rebuild: confirm Loki shows [DanglingImageCleanup] Cleanup pass complete: { removedCount: N>0, reclaimedBytes: M>0 } within 30 min
  • Run one-off docker image prune --filter "until=24h" on prod to clear the pre-label backlog

🤖 Generated with Claude Code

…ally matches

`scanAndCleanupDanglingImages` (PR #1243) filters images by `dangling=true
AND label=cascade.managed=true`. The label clause is the safety belt that
keeps the loop from reaping unrelated host workloads (ucho-dev/prod, MySQL,
Loki, etc.) — but the label was never applied to cascade-built images.
`cascade.managed=true` was only set as a CONTAINER label at run time
(container-manager.ts), never as an IMAGE label via a `LABEL` directive in
any Dockerfile.

Live verification on the dev host: 140 dangling images present, but
`docker images --filter dangling=true --filter label=cascade.managed=true`
returns zero. Every Loki cleanup-pass log line shows
`removedCount=0, reclaimedBytes=0` — the loop has been a no-op since deploy.

Adds `LABEL cascade.managed=true` to all five cascade Dockerfiles (router,
worker, dashboard, frontend, selfhosted) so newly-built images carry the
label, dangling rebuilds inherit it, and the existing strict filter starts
matching exactly the right set. No code change in the cleanup loop. No
filter widening.

Static guard test pins both halves of the contract: the filter shape AND
the per-Dockerfile LABEL directive. A new `Dockerfile.<svc>` without the
label fails CI loud.

Pre-label dangling backlog (~130 images on prod) needs a one-off manual
prune; documented in PR body. Out of scope: cascade-worker tag bloat (29
SHA-pinned tags accumulating) — separate retention loop, separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test review

🕵️ claude-code · claude-opus-4-6 · run details

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Clean, well-scoped fix for a real production issue — the dangling-image cleanup loop was a complete no-op since deploy because no built image carried the cascade.managed=true label it filters on. The fix correctly adds LABEL cascade.managed=true to the production stage of all five Dockerfiles, and the static guard test pins both halves of the contract going forward.

Code Issues

Should Fix

  • All five Dockerfiles — The comment block references PR #12xx (a placeholder) instead of the actual PR #1256. This should be updated for traceability — when someone reads these comments in 6 months, #12xx won't resolve to anything useful.

🕵️ claude-code · claude-opus-4-6 · run details

Comment thread Dockerfile.dashboard Outdated

# `cascade.managed=true` is the contract the router's dangling-image cleanup
# loop filters on (src/router/dangling-image-cleanup.ts). Without this LABEL,
# the loop matches zero images and reclaims nothing — see PR #12xx.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD_FIX] Placeholder PR #12xx should be PR #1256. Same placeholder appears in all five Dockerfiles — worth a quick find-and-replace before merge.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Replaced the PR #12xx placeholder with the actual PR #1256 across all five Dockerfiles.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Replaced the PR #12xx placeholder with PR #1256 across all five Dockerfiles. The commit is at e4192ce.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit c7540ee into dev May 3, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/dangling-image-cleanup-dockerfile-label branch May 3, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants